-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] stop relying on string concatenation / splitting for cv() eval results #6761
Conversation
metric_type[key] = one_line[3] | ||
cvmap.setdefault(key, []) | ||
cvmap[key].append(one_line[2]) | ||
return [("cv_agg", k, float(np.mean(v)), metric_type[k], float(np.std(v))) for k, v in cvmap.items()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, removing this "cv_agg"
string literal, is the key change... everything else flows from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks a lot for clear refactoring!
Just some minor suggestions.
Co-authored-by: Nikita Titov <[email protected]>
…into python/remove-cv-agg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for working on this and for taking my comments into account!
Contributes to #6748
There are a few activities where
lightgbm
(the Python package) needs to inspect the output of one or more evaluation metrics on one or more datasets.For example:
For
train()
and other APIs that end up using it, it tracks those using a list of tuples like this (pseudocode):cv()
does something similar. However, its "metric value" is actually a mean of such values taken over all cross-validation folds. Because multiple values are being aggregated, it appends a 5th item with the standard deviation.Some code in
callbacks.py
needs to know, given a list of such tuples, whether they were produced by cross-validation or regulartrain()
.To facilitate that while still somewhat preserving the schema for the tuples, the
cv()
code:"cv_agg"
to the beginning of the tupleSo e.g.
("valid1", "auc", ...)
becomes("cv_agg", "valid1 auc", ...)
. That happens here:LightGBM/python-package/lightgbm/engine.py
Lines 580 to 592 in 480600b
Every place dealing with such tuples then needs to deal with that, including splitting and re-combining that second element. Like this:
LightGBM/python-package/lightgbm/callback.py
Lines 416 to 418 in 480600b
This proposes changes to remove that, so that the
cv()
andtrain()
tuples follow a similar schema and all the complexity of splitting and re-combining names can be removed.It also standardizes on the names from #6749 (comment)
Notes for Reviewers
This change should be completely backwards-compatible, including with user-provided custom metric function. The code paths here are well-covered by tests (as I found out from many failed tests while developing this 😅 ).